FIX: Removed duplicate convolution for DoRA #2153

gslama12 · 2024-10-16T11:46:08Z

This pull request fixes two problems:

1. Duplicate Convolution in DoRA implementation for ConvNd Layers:
Since the base layer convolution is already computed in layer.py, we don't need to compute it again in the dora.py. Computing it again doubles the FLOPs consumption during the forward pass resulting in significantly higher FLOPs overall. We can pass the result from the base layer computed in layer.py to the forward pass of the _DoraConvNdLayer in dora.py and save computational resources.

2. Bugfix for DoRA regarding Convolutional Layers using the Groups Argument:
CNNs that for example use depthwise separable convolutional layers result in an error when applying DoRA. Adjusting the dimension of the conv_layer in layer.py fixes this issue.

BenjaminBossan · 2024-10-16T14:59:15Z

Thanks for this PR. We're aware of this potential inefficiency but I think it's not as easy as re-using the base result. The reasoning is explained here. Back then, we were only dealing with linear layers but I'm certain that the same logic applies to convolutional layers.

The good news is that this optimization is indeed possible if dropout is set to 0 or if we're in eval mode, see #2122. LMK what you think.

gslama12 · 2024-10-17T07:12:42Z

Thanks for the clarification! Would it be possible to apply the dropout for DoRA similar to how LoRA handles it, i.e. in lora_B(lora_A(x)) during the forward pass? It seems like LoRA also uses the pre-dropout information (result) here and does not re-compute the convolution: result = result + lora_B(lora_A(dropout(x))) * scaling (ln 1120 in layer.py)?

Also, what do you think of the fix for convolutional layers using the "groups" argument?

BenjaminBossan · 2024-10-17T13:15:59Z

Would it be possible to apply the dropout for DoRA similar to how LoRA handles it, i.e. in lora_B(lora_A(x)) during the forward pass? It seems like LoRA also uses the pre-dropout information (result) here and does not re-compute the convolution

Could you clarify what you mean here, maybe with a small code example? Note that we have to ensure that the implementation sticks with the specification of the original paper.

When we have no dropout, though, we should be able to make the same optimization as in #2122 though.

Also, what do you think of the fix for convolutional layers using the "groups" argument?

I wasn't aware of the groups argument in Conv2d. Quite possibly your solution is correct. Could you give a small example, so that we can build a unit test based on that? Also, could you explain why we only need it for DoRA?

gslama12 · 2024-10-19T12:52:50Z

So the reasoning behind why the DoRA optimization is not possible when we use lora_dropout != 0 was, that the magnitude vector in the code snippet below will receive with pre-dropout information in form of the base layer result.

x = dropout(x)
result = result + self.lora_magnitude_vector[active_adapter](
                        x,
                        lora_A=lora_A,
                        lora_B=lora_B,
                        scaling=scaling,
                        base_layer=self.get_base_layer(),
                        base_layer_result=result
                    )

After looking at the DoRA paper, I can't figure out why this is an issue. If we see DoRA as a magnitude * direction, where LoRA is applied to the direction component, shouldn't we be able to apply LoRA to the direction component using lora_B(lora_A(dropout(x)))? Basically computing the DoRA result as: result_dora = (mag_norm_scale - 1) * base_layer_result + mag_norm_scale * lora_B(lora_A(dropout(x))) * scaling in dora.py.

Based on the paper, why would we need to compute the full-rank convolution again with a dropout(x) as the input (dora.py, ln 161)?

BenjaminBossan · 2024-10-21T09:52:00Z

Note that when we have LoRA+DoRA+dropout, we ensure that dropout is consistently applied to the LoRA part and the "base_result" part. If we use the result from the base layer directly (i.e. base_layer_result in your code), the LoRA-dropout is not applied to it, therefore, the result differs.

gslama12 · 2024-10-21T14:23:00Z

I think I understand your point. But if we look at LoRA (e.g. ln 1120 in layer.py) we see, that we also don't apply the lora_dropout to the result here:

result = self.base_layer(x, *args, **kwargs)
.
.
.
if not self.use_dora[active_adapter]:
    result = result + lora_B(lora_A(dropout(x))) * scaling

So my question is wether the "base_result" part even needs the dropout for DoRA. And if we do need the dropout in the "base_result" part, why do we not need it for LoRA?

BenjaminBossan · 2024-10-22T09:35:38Z

Exactly. The result from the base model does not contain any LoRA-dropout applied to the x. However, for the DoRA-part, we need to ensure that the same dropout is applied to the x of base_layer_result as for the LoRA part. In the existing code, we apply dropout to x, then pass x to the DoRA layer, where x is passed to the convolution operation. Therefore, this requirement is met.

In your proposed code, we would instead use the result from the base layer which does not include x with dropout. Therefore, this is a different output and the DoRA calculation would no longer correspond to what is described in the paper.

Only if there is no dropout can we re-use the base result in the way you proposed, as is shown in #2122. (In addition, we also need to take care of potentially existing bias terms, but I left that out to simplify the discussion.)

You can also create a LoRA layer with DoRA and check that the outputs differ when dropout is applied between the old code and the suggested change (fix the seed to ensure that there is no randomness involved).

gslama12 · 2024-10-22T15:48:10Z

Yes, I understand the reasoning and that my suggestion would produce a different output. I just could not find the reasoning for why the dropout needs to be applied like this in the DoRA paper. But I'll assume you're right.

The reason why I am questioning this is because we do not seem to use the dropout in the same way for LoRA. So my question is:

Why do we not apply the dropout to the result for the base layer for LoRA in the code example?

result = self.base_layer(x, *args, **kwargs) 
.
.
.
if not self.use_dora[active_adapter]:
    result = result + lora_B(lora_A(dropout(x))) * scaling   # why can we use the base_layer result without dropout here?

BenjaminBossan · 2024-10-23T13:15:53Z

Okay, I understand now. Our calculation for DoRA is a bit more complicated than if we simply followed the equation from the paper. The reason is that we first calculate the base result from the base model. Thus we need to correct it later, which is why we have the (mag_norm_scale - 1) term, which cannot be directly seen in the paper. I agree that it's not immediately obvious why for that part, we require the dropout being applied.

Trying to fit this into an equation, this is what I get (note that I omitted the scale for simplicity):

If we did not have the dropout in the base result part, then the first 2 terms in the final equation, W x and W drop(x) would cancel out, which looks correct. However, it would also mean we would not be able to simplify the last term, (W + B A) drop(x), which looks incorrect.

Not sure if @nbasyl would have time to check this.

gslama12 · 2024-10-23T17:58:15Z

Right! And if we look at the implementation of LoRA (ln 1121 in layer.py), we see that there we also can't simplify the term W x + b + (B A) drop(x) to (W B A) drop(x) + b.

Since the dropout we are talking about here is actually a lora_dropout, I feel like it would be more intuitive, if it was only applied to the LoRA part of the DoRA implementation (i.e. only to the term m/W_n B A drop (x)). Also it would seem strange to have this non-zero difference W x - W drop(x) in the final result equation if we set lora_dropout != 0.

gslama12 · 2024-10-30T15:12:36Z

Any updates on this?

BenjaminBossan · 2024-10-30T15:49:59Z

Not sure if Shih-Yang currently has time to look into this. In the meantime, how about opening a separate PR for the groups issue. We can also open a separate PR along the lines of #2122, since this optimization should definitely work for dropout=0 or training=False.

github-actions · 2024-11-29T15:04:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2024-11-29T15:17:11Z

not stale

nbasyl · 2024-12-18T12:38:40Z

Hi @gslama12 @BenjaminBossan, apologies for the delayed response—I was completely swamped with schoolwork last month. My initial intuition regarding the DoRA implementation was to ensure that dropout(x) applies "only" to all trainable parameters, including not just the LoRA components but also the magnitude vector m. This is why I wanted to keep the base result intact.

However, I now see your point that the non-zero difference in W x - W drop(x) in the final result equation, when lora_dropout != 0, is indeed a bit unintended. I hadn’t noticed that issue until you pointed it out. To address this, I think we could modify the DoRA implementation so that the magnitude vector in the first term of DoRA is detached from the gradient graph, ensuring it’s unaffected by dropout. For example, we could try something like this:

We'd still need to confirm whether this adjustment improves or worsens performance by actually running experiments. That said, I’d recommend sticking to dropout = 0 for now. From my experience—and based on insights from others, like Answer.ai—setting dropout to zero works perfectly fine with DoRA. I am working on potentially a DoRA V2 and I will make sure to conduct experiments regarding this matter and I’ll keep you guys updated if I come across anything interesting to share. In the mean time, I think let's just stick with the original DoRA implementation which has been widely tested and verified.

BenjaminBossan · 2024-12-18T17:13:51Z

Thanks a lot for your comment @nbasyl, what you say makes sense. I agree that we should not change the current implementation, as it would break reproducibility of training DoRA adapters. Instead, as you suggested, when there is an improvement, we can later implement it as a new adapter or as a new option for LoRA/DoRA.

@gslama12 Would you be interested in implementing the optimization for the non-dropout case for convolution layers, similar to #2122?

gslama12 · 2024-12-19T11:50:57Z

@nbasyl Thanks for the reply!
@BenjaminBossan Yes i can change optimization similar to #2122, such that it is only used in the case where dropout=0.

BenjaminBossan · 2025-01-10T16:07:55Z

@gslama12 Are you still interested in working on this? Also, the other issue reported of PEFT not honoring the groups argument is still outstanding, do you plan to create a PR for that?

SlamanigG · 2025-01-10T18:20:36Z

@BenjaminBossan Yes! I am currently quite swamped with schoolwork but I will surely get to it in February if that is fine for you. I will also look into the groups issue then and see what can be done there.

github-actions · 2025-02-04T15:04:03Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2025-02-04T15:09:13Z

not stale

gslama12 · 2025-02-08T19:45:05Z

Reopened as #2371

BenjaminBossan · 2025-02-11T14:32:30Z

The DoRA optimization for conv layers is now merged. Please LMK if you still plan on working on adding the groups argument.

gslama12 · 2025-02-11T14:54:58Z

Yes I will think about the groups fix, but it might be slightly more complex as we need to also consider this when merging the weights.

BenjaminBossan · 2025-02-11T16:06:17Z

Yes I will think about the groups fix, but it might be slightly more complex as we need to also consider this when merging the weights.

Good point. In case that it's not easily solved, we can still support it, raise an error when trying to merge, and probably also add a warning at init time that the model using groups means users won't be able to merge later on. That way, users who don't need merging can still benefit from the feature.

gslama12 · 2025-02-11T18:24:07Z

Ok that sounds like a good plan. I will let you know when i have worked something out.

BenjaminBossan · 2025-02-12T12:11:06Z

@gslama12 I did some experiments, I wonder if it is not actually quite simple. Here is the diff I came up with:

modified   src/peft/tuners/lora/layer.py
@@ -1058,10 +1058,16 @@ class _ConvNd(nn.Module, LoraLayer):
         kernel_size = base_layer.kernel_size
         stride = base_layer.stride
         padding = base_layer.padding
+        groups = base_layer.groups
+        if (self.in_features % groups != 0) or (self.out_features % groups != 0):
+            raise ValueError("TODO this should not happen")
+        if r % groups != 0:
+            raise ValueError("TODO rank must be divisible by groups")
+
         conv_layer = type(base_layer)
         out_kernel = out_stride = (1,) * (self._kernel_dim - 2)
-        self.lora_A[adapter_name] = conv_layer(self.in_features, r, kernel_size, stride, padding, bias=False)
-        self.lora_B[adapter_name] = conv_layer(r, self.out_features, out_kernel, out_stride, bias=lora_bias)
+        self.lora_A[adapter_name] = conv_layer(self.in_features, r, kernel_size, stride, padding, bias=False, groups=groups)
+        self.lora_B[adapter_name] = conv_layer(r, self.out_features, out_kernel, out_stride, bias=lora_bias, groups=groups)
         self.lora_bias[adapter_name] = lora_bias
 
         if use_rslora:
@@ -1231,14 +1237,17 @@ class _ConvNd(nn.Module, LoraLayer):
 
         weight_A = self.lora_A[adapter].weight
         weight_B = self.lora_B[adapter].weight
+        base_layer = self.get_base_layer()
 
         if cast_to_fp32:
             weight_A = weight_A.float()
             weight_B = weight_B.float()
 
         # https://github.com/bmaltais/kohya_ss/blob/feb6728762a8f463d15ba936d189d4c3abfaa1ab/networks/lora.py#L117
-        if self.get_base_layer().weight.size()[2:4] == (1, 1):
-            # conv2d 1x1
+        is_1x1 = base_layer.weight.size()[2:] == (1,) * (self._kernel_dim - 2)
+        if is_1x1:
+            if base_layer.groups != 1:
+                raise ValueError("TODO")
             output_tensor = (weight_B.squeeze(3).squeeze(2) @ weight_A.squeeze(3).squeeze(2)).unsqueeze(2).unsqueeze(
                 3
             ) * self.scaling[adapter]
@@ -1247,6 +1256,7 @@ class _ConvNd(nn.Module, LoraLayer):
                 self.conv_fn(
                     weight_A.transpose(0, 1),
                     weight_B,
+                    groups=base_layer.groups,
                 ).transpose(0, 1)
                 * self.scaling[adapter]
             )

It's just quick and dirty, but in my initial testing it seems to work. Unsolved so far is the 1x1 conv part, since we're not using the self.conv_fn there. But I wonder if we can't just make the switch and avoid having to re-implement the groups logic.

Overall, I wonder how useful it is to apply LoRA to conv layers with groups, since they are already quite parameter efficient. But it shouldn't hurt to include it.

gslama12 · 2025-02-12T17:32:07Z

I actually tried to reproduce the error that I originally got for DoRA when using conv layers with groups and to my suprise I couldn't. There might be something that has changed since, as it seems to work fine with the current implementation.

Regarding your last point: Yes conv layers that use groups are quite efficient during inference but during training lora can still reduce FLOPs and memory by quite a bit.

BenjaminBossan · 2025-02-12T18:04:23Z

I actually tried to reproduce the error that I originally got for DoRA when using conv layers with groups and to my suprise I couldn't. There might be something that has changed since, as it seems to work fine with the current implementation.

Is that including merging? Also, just because there is no error does not mean it works correctly.

Regarding your last point: Yes conv layers that use groups are quite efficient during inference but during training lora can still reduce FLOPs and memory by quite a bit.

I didn't do the math, but it's good to know it can still be useful.

gslama12 · 2025-02-17T19:01:23Z

Yes, also no error for merging. But of course, as you stated, this does not mean that it works correctly.

The fix I had previously was basically changing the update_layer function in layer.py ln907 to this (I am however not sure if this is the right way to go):

self.lora_B[adapter_name] = conv_layer(r, int(self.out_features / self.base_layer.groups), out_kernel, out_stride, bias=False)

BenjaminBossan · 2025-02-18T10:24:31Z

I had indeed also tried to apply groups only to lora_B but ran into issues after adding unit tests (don't remember which), hence the solution I came up with. Whatever the correct way is, we would need to be careful to ensure that it works as intended.

gslama12 · 2025-02-27T07:32:05Z

I just revisited this issue and I think you suggestion from here is actually something different to what I intended. In your approach you actually transfer the groups argument to the lora layers, making the adapter also a conv layer with groups. I think the adapter does not need to have a groups argument, because it will have very few trainable parameters anyway, meaning we do not need the additional efficiency that the groups argument can provide here.

Instead my idea was to fix the dimensions of the lora_A and lora_B weights, such that they do not cause issues when the base_layer has groups. I think taking the approach from the Microsoft LoRA implementation is a safe bet.

I have opened this in a separate PR in #2403.

* add argument for dropout * increase default lr * change default lr in examples * fix bug in calculation of KL batch size * KL batch size should be args.per_device_train_batch_size * Update kto_trainer.mdx with hparam recs * typo * allow dropout to be disabled * update lr in sample scrippt * Update kto_config.py * Update trl/trainer/kto_trainer.py * Update docs/source/kto_trainer.mdx --------- Co-authored-by: Kashif Rasul <[email protected]>

BenjaminBossan mentioned this pull request Oct 16, 2024

Optimize DoRA in eval and no dropout #2122

Merged

gslama12 changed the title ~~FIX: Removed duplicate convolution for DoRA & fixed error for ConvNd layers using "groups"~~ FIX: Removed duplicate convolution for DoRA" Nov 5, 2024

gslama12 changed the title ~~FIX: Removed duplicate convolution for DoRA"~~ FIX: Removed duplicate convolution for DoRA Nov 5, 2024

BenjaminBossan mentioned this pull request Jan 10, 2025

Issue merging a Lora model to a SANA transformer #2318

Closed

4 tasks

gslama12 closed this Feb 8, 2025

gslama12 force-pushed the main branch from 49d9bd7 to 40fe166 Compare February 8, 2025 19:16

gslama12 mentioned this pull request Feb 8, 2025

Optimization for ConvNd if dropout=0. #2371

Merged

gslama12 mentioned this pull request Feb 27, 2025

FIX for ConvNd layers using the groups argument. #2403

Merged

FIX: Removed duplicate convolution for DoRA #2153

FIX: Removed duplicate convolution for DoRA #2153

Uh oh!

Conversation

gslama12 commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Oct 16, 2024

Uh oh!

gslama12 commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Oct 17, 2024

Uh oh!

gslama12 commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Oct 21, 2024

Uh oh!

gslama12 commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Oct 22, 2024

Uh oh!

gslama12 commented Oct 22, 2024

Uh oh!

BenjaminBossan commented Oct 23, 2024

Uh oh!

gslama12 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gslama12 commented Oct 30, 2024

Uh oh!

BenjaminBossan commented Oct 30, 2024

Uh oh!

github-actions bot commented Nov 29, 2024

Uh oh!

BenjaminBossan commented Nov 29, 2024

Uh oh!

nbasyl commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Dec 18, 2024

Uh oh!

gslama12 commented Dec 19, 2024

Uh oh!

BenjaminBossan commented Jan 10, 2025

Uh oh!

SlamanigG commented Jan 10, 2025

Uh oh!

github-actions bot commented Feb 4, 2025

Uh oh!

BenjaminBossan commented Feb 4, 2025

Uh oh!

gslama12 commented Feb 8, 2025

Uh oh!

BenjaminBossan commented Feb 11, 2025

Uh oh!

gslama12 commented Feb 11, 2025

Uh oh!

BenjaminBossan commented Feb 11, 2025

Uh oh!

gslama12 commented Feb 11, 2025

Uh oh!

BenjaminBossan commented Feb 12, 2025

Uh oh!

gslama12 commented Feb 12, 2025

Uh oh!

BenjaminBossan commented Feb 12, 2025

Uh oh!

gslama12 commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Feb 18, 2025

Uh oh!

gslama12 commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

gslama12 commented Oct 16, 2024 •

edited

Loading

gslama12 commented Oct 17, 2024 •

edited

Loading

gslama12 commented Oct 19, 2024 •

edited

Loading

gslama12 commented Oct 21, 2024 •

edited

Loading

gslama12 commented Oct 23, 2024 •

edited

Loading

nbasyl commented Dec 18, 2024 •

edited

Loading

gslama12 commented Feb 17, 2025 •

edited

Loading